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Abstract 

A new parallel algorithm for simulating Ising spin 
systems is presented. The sequential prototype is the 

fold way algorithm [2], which is efficient but is hard 
to parallelize using conservative methods. Our par- 
allel algorithm is optimistic. Unlike other optimistic 
algorithms, e.g., Time Warp, our algorithm is syn- 
chronous. It also belongs to the class of simulations 
known as "relaxation" hence it is named "syn- 
chronous relaxation." We derive performance guar- 
antees for this algorithm. If N is the number of PEs, 
then under weak assumptions we show that the num- 
ber of correct events processed per unit of time is, 
on average, at least of order N/\ogN. All commu- 
nication delays, processing time, and busy waits are 
taken into account. 

Key words: conservative simulation, optimistic 
simulation, computational physics, Metropolis algo- 
rithm 

1 Introduction 

The Ising model of computational physics was intro- 
duced in 1925 |7| for describing magnetization phe- 
nomena. It has since been in use both for its original 
purpose, and as a computational metaphor in areas 
ranging from economics jTTI to wireless communica- 
tions 00 to material science |T5] . 

In a simple version of the model we have a pla- 
nar lattice of atoms, each of which may be in one 
of two spin states, +1 or — 1. The atoms flip their 
spins stochastically in a way that depends only on 
the states of their nearest neighbors, on an external 
magnetic field, and on temperature. 

Ising spin simulations have always been slow. 
There have been a number of proposals for speed- 
ing them up. The fastest available serial algorithm 



that does not violate the stochastic properties of the 
procedure by Metropolis et al. ^2] is the n-fold way 
algorithm |2J. A parallel version of the n-fold way 
was proposed in [111 112) and implemented in [§| . Its 
speedup with respect to the serial n-fold way algo- 
rithm was not very high in low temperature. This 
was because at low temperature the algorithm per- 
formed many tentative calculations of flips that were 
rejected at the end. In the procedure atoms along 
the boundaries of regions hosted by different pro- 
cessing elements were subject to such rejections; only 
atoms in the interior of regions were not. As the re- 
jection rate asymptotically approaches 100%, even if 
the fraction of atoms on the boundaries is small, the 
rejection overhead becomes the dominant bottleneck 
in simulation. Note that the main advantage of the 
n-fold way over the procedure in Q] is absence of 
rejections. 

The parallel algorithms in |11U12| are conservative. 
Herein we describe a new algorithm for Ising spin sim- 
ulations whose aim is to eliminate the drawbacks as- 
sociated with rejections. The algorithm is optimistic; 
it belongs to the class of "relaxation" simulations |3] . 
It is also a synchronous algorithm, and as such, it 
differs from Time Warp [g] . Our "synchronous relax- 
ation" implements the n-fold way algorithm in par- 
allel by repeatedly generating a time segment of the 
spin flip trajectory. After each iteration, the PEs ex- 
change information they generated, and correct the 
errors thus revealed. Unlike conservative schemes in 
UHIII, computations are not wasted on rejections. 
And unlike Time Warp, erroneous events are cor- 
rected in a synchronous, iterative fashion rather than 
asynchronously. However, iterations might introduce 
extra computations. 

Since Time Warp is asynchronous, and our algo- 
rithm is not, the behavior and analyses of these algo- 
rithms are quite different. In our analysis we estimate 



the number of iterations needed to correct mistakenly 
simulated events and the amount of work in each it- 
eration. Under weak assumptions, we show that the 
former is small and the latter is not excessive. Syn- 
chronous relaxation was discussed in 0] in a different 
context, and without formal proofs. 

Note that Metropolis algorithm in ^1] simulates a 
certain Markov Chain. Markov Chain uniformization 
is put forward in 115| as a tool for parallelizing a 
sequential Markov Chain simulation. Algorithms in 
[TT1 turn out to be a specialization of more gen- 
eral schemes in [§]. It should be noted that the actual 
development of Ising spin simulations did not follow 
this logic. Specifically, the rejection scheme |2|> was 
proposed as an inherently serial algorithm, without 
any reference to parallel processing or uniformiza- 
tion. Then scheme [5] was proposed as a derivative 
of, and an improvement over |14j . since it got rid of 
event rejections. Note that according to it 
is the scheme in ^3] that would rather be consid- 
ered a derivative of [2], the derivative which is more 
amenable to parallelization. Actually, it took a while 
[111 IT2*| to break the tradition of thinking the scheme 
[T4"] as being inherently serial. Optimistic methods of 
parallelization of Markov Chain simulations are dis- 
cussed in |15| . but still in connection with uniformiza- 
tion, i.e., with event rejections. Since our present syn- 
chronous relaxation method for parallelization of the 
scheme in [5] eliminates rejections, it apparently does 
not follow from [T5J ; its efficiency analysis is certainly 
new. 



2 The Ising model 

In the model atoms are located at the vertices of a 
rectangular subset G of an orthogonal lattice. A con- 
figuration is defined by the spin variables s(v) = ±1 
for atoms v G G. In accordance with the time 
evolution of the configuration is a sequence of sin- 
gle spin updates: given a configuration, define the 
next configuration by choosing a vertex v uniformly 
at random and changing s(v) to —s(y) with indepen- 
dent probability p, computed as described in the next 
paragraph. With probability 1—p, the s(v) remains 
unchanged. 

Computing the probability p = p(v) involves the 4 
nearest neighbors of vertex v = (i + 1, j), (i — 

1, j), (i, j + — Periodic boundary conditions 

are assumed in both directions. Computing the p(v) 



also involves the Hamiltonian 

E = -J s{v)s{v>)-HY,<v) (2.1) 

v,v'GG v 

where J > and H are constants, and the pairs 
(v,v r ) of indices in the first sum are nearest neighbors. 
The value of p is given in |14| by 

p = min{l,0}, 9 = exp{-AE/k B T ahs ) (2.2) 

where AE = AE(v) is the energy change that would 
result from the flip of the spin at site v, ks is Boltz- 
mann's constant, and T a b s is the absolute tempera- 
ture. In [5] a different expression for p is used: 

p = 0/(1 + 6) (2.3) 

with the same 6. Note that computing AE(v) for 
spin v involves only the values of s(v) and s(v') for 
the nearest neighbors v' of v. According to 1)2. 2[l . 
a spin flip which results in lower energy, AE < 0, 
certainly occurs, because in such a case, p = 1; but 
in 1)2.3)1 p < 1 always and even the flips with AE < 
can sometimes be rejected. 

3 The n-fold way algorithm 

We suppose that attempts of a flip by each atom con- 
stitute an independent Poisson process 1 with fixed 
rate, say A. Then the successful attempts is also a 
Poisson process, but with a smaller rate \p, p = p(t), 
which, in general, varies in time. 

The n-fold way algorithm in [2| splits all atoms into 
n classes, where a class is defined as those atoms with 
the same flip probability p, given either by 1)2.2)1 or 
1)2.3)1 . In the 2D Ising model there will be no more 
than n = 10 classes: each atom may be either up 
or down, and its four neighbors have 5 possibilities 
(none up, one up,. . ., 4 up), so there are 2x5 possible 
configurations for each atom For each class k define 

Afe = number of atoms in class k (3-1) 
Pk — P (atom in class k flips when chosen) . (3.2) 

Given that the Poisson rate of flips of a single atom 
of class k is Xpk, the combined rate of flips of all 
atoms is 

A = \J2N kPk , (3.3) 

k 

In |5| 1141 system state changes are just a sequence. For 
the sake of parallelization we need spin flips to be a Poisson 
process in continuous time. This assumption is consistent with 
the models in 121 1141 . but is not mentioned there. 
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and the combined process whose arrivals coincide 
with spin flips at any atom also enjoys the Poisson 
property. As spins flip, membership of atoms may 
change, Nk may vary in time N/, = Nk(t), and so may 
the Poisson rate of the combined process A = A(i). 

It follows that given the time 7~i_i of the (i — l)th 
flip, the time of the ith flip can be generated as 

T4 = Tl - 1 + aT^T)' (3 ' 4) 

where Ui is the ith independent sample of a random 
(0,1) uniformly distributed variable. The first spin 
flip time t\ can be generated as in l|3.4|) with % = 1 if 
we formally set r = 0. 

We should then select the atom whose spin is to 
flip. This is done in three phases. In Phase 1 we 
generate an independent random (0,1) uniformly dis- 
tributed variable V^ In Phase 2 we choose a class 
fc* to which the atom to be flipped belongs. This 
is done by linearly scanning the sequence of classes 
k = 1,2,. while summing the weights of their 
chances to be selected 



N. 



Once the inequality 



W(k - 1) < Vi < W{k) 



(3.5) 



(3.6) 



is detected for fc = fc,, we stop. 

In Phase 3 we choose an atom in class fc*. All 
atoms in each class are maintained in a linear order. 
To determine the index of an atom in the order we 
take the normalized residual i?j of the random sample 
V;. This is defined as 



R, 



Vi - W{K - 1) 



N kt p k 



(3.7) 



where fc* is the class index found in Phase 2. The 
required index is then 

access index = [RiNk,\ + 1. (3.8) 

Note that each spin flip changes the class member- 
ship of several atoms. We omit here the discussion of 
an elaborate data structure [2] which maintains the 
classes in the computer memory so that the atoms in 
the classes are positioned in the required linear or- 
der. We note that it only takes O(l) computations 
to adjust this order at each spin flip. Experiments 
[2] confirm that at a low temperature, the n-fold way 
algorithm is much faster than the original Metropolis 
et al. algorithm with rejections in |14|. 



Synchronous relaxation: 
general formulation 
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Suppose that a simulation of a system is to be per- 
formed on a multiprocessor with N processing ele- 
ments (PEs). Our procedure is to give each PE a 
subsystem to host, and have the PE produce the his- 
tory of that subsystem. 

Each PE will keep track of the simulated time be- 
fore which all simulated histories are known; this 
quantity is called committed time. Committed time 
increases in steps, in synchrony among all PEs; its 
value is common to all PEs. Each step consists of sev- 
eral iterations. At each iteration, each PE produces 
a tentative or speculative history of the subsystem it 
hosts beyond the current committed time. The PE 
extends its local history until its local time reaches 
the committed time plus T max , where T max is the step 
size of committed time increases. Since subsystems 
are, in general, connected, in order to produce a cor- 
rect local history, a PE needs to know the correct 
local histories of other PEs. But they are not known, 
because other PEs are in the same quandary. If, by 
some miracle, a PE knew the correct histories of all 
others, then it would produce a correct history for 
itself; when all PEs produce correct histories, then 
the next committed time will become committed time 
plus T max . So all we have to explain is how to achieve 
this miracle. 

The mechanism is by iterations. During the first 
iteration, each PE makes the simplest assumption 
about the histories of the other PEs; we call this 
the canonical assumption. For example, it may as- 
sume that the states of the other subsystems do not 
change. This assumption will enable it to produce 
its own history. After all the PEs generate local his- 
tory for an additional T max units of simulated time, 
they compare their histories. This comparison is done 
in synchrony, in the sense that the PEs perform the 
comparison only after they have all completed the 
previous task of generating their histories. As a rule, 
there will be inconsistencies between the assumed and 
actual (generated) histories of other PEs. If so, the 
PEs need to perform more iterations. 

During subsequent iterations, if a PE needs to 
know the local history of another PE, it uses that 
history generated in the last iteration. The goal of 
producing correct histories at a step is achieved if no 
PE detects any inconsistencies between the assumed 
and actual history of any other PE. Once this hap- 
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pens, all PEs increase committed time by T max , and 
continue. 

We now give a representation of the synchronous 
relaxation algorithm. In the outset of the simula- 
tion, committed time is set to zero. The subsystems 
hosted by the PEs are set to their initial states. Then 
each PE executes the following code. The execution 
is asynchronous except for the two "synchronize" 
statements. While executing a "synchronize," each 
PE waits for the other PEs to reach the same 
statement. 



There are two issues that need to be addressed. 
Will there ever come a time when no inconsistencies 
are detected? Another is the question of correctness. 
Is the lack of inconsistencies equivalent to correctness 
of the history? Generally, the iterations will not ter- 
minate. However, when applied to discrete event sim- 
ulations, under mild conditions we can show that the 
iterations do terminate. Also, under weak conditions, 



we can show that lack of inconsistency is equivalent 
to correctness (see [T3]V 

The preceding algorithm turns out to be applicable 
not only to deterministic simulations, but to stochas- 
tic simulations, too, using the following technique. 
Sequences of (pseudo)random numbers are employed 
in stochastic simulations. We should treat these se- 
quences as deterministic. To this end, all we have 
to do is to reuse the (pseudo)random numbers gener- 
ated at an iteration, during the course of the following 
iterations. An extreme way of effecting this is to gen- 
erate all random numbers in a list before the rest of 
the simulation begins. This is a helpful way to think 
about simulation, whether or not it turns out to be 
a practical method in any instance. 

The general description above may not give enough 
details for a specific implementation. Furthermore, it 
may not be clear how to estimate the efficiency of the 
algorithm; specifically, how to bound the number of 
iterations in each step. The next sections will fill in 
these details for the Ising model. 

5 Synchronous relaxation for 
Ising simulation 

In this section we detail the synchronous relaxation 
algorithm for Ising simulation using the n-fold way 
algorithm. While simulations can be implemented 
correctly even if different PEs run different serial al- 
gorithms to generate local history, we are interested 
in high performance, so we will concentrate on a par- 
allclization of the most efficient serial algorithm. 

We follow the algorithm outlined in Section^] The 
set G is partitioned into subsets G,, 1 < i < N, and 
each Gi is hosted by a separate processing element, 
PEj. Each Gi corresponds to a subsystem in the gen- 
eral description of Section^ As before, the boundary 
of Gi is denoted dGi. 

Statement 1 of the general relaxation procedure re- 
quires a choice of step size T max for each committed 
time increase. In the simplest algorithm formulation, 
we choose the same step value to serve for all com- 
mitted time increase steps, 

log N 

Tmax = \\dG~y (5 ' 1} 

assuming that all the sizes of boundary regions |<9Gi| 
are equal, and where A, as in Section is the largest 
rate at which any atom can flip. 



DO (step) 

1. Choose a step size T max for committed 

time increase. 

2. Make canonical assumption about 
other PEs' histories. 

DO (iteration) 

(a) Generate local history, starting with 

committed time until 

local time > committed time -t-Tmax- 

(b) Synchronize. 

(c) Compare generated history of other PEs 

with the corresponding assumptions. 
If they differ, replace the current 
assumptions with the corresponding 
histories. 

(d) Synchronize. 

UNTIL (iteration) all PEs detect 

no difference in the comparison in step !2cl 

3. Increment committed time by T max . 

UNTIL (step) committed time exceeds 
a prespecified value. 
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Statement 2 of the general relaxation procedure re- 
quires each PE to make a canonical assumption con- 
cerning its neighbors' histories. We choose the as- 
sumption that no spin flip occurs in any neighboring 
atom during the initial committed time increase step 

1 max- 

Statement (a) requires each PE at each iteration 
to generate a segment of local history. The segment 
begins at the current committed time value, let us call 
it t c , and ends at time t c + T max . The history can be 
identified with a sequence t c < r\ < t% < . . . < r m < 
tc + T max of spin flip times and the corresponding 
sequence of atoms V\, 1)2, ■ ■ ■ v m that flip their spins 
at these times. Because the procedure is complex 
and subtle, our discussion will be lengthy. We split 
it into several cases. 

First iteration, first step. Thanks to the 
canonical assumption, the procedure executed by 
each PE in this case is almost a literal repeti- 
tion of the sequential algorithm described in Sec- 
tion As in that algorithm, a sequence of pairs 
(Ui, Vi), (U2, V2), . . . , (U m , Vm), of independent sam- 
ples of uniform (0,f) distributed random variables Ui 
and Vi, feeds the generation of the spin flip history. 
The Ui feeds formula l|3.4[l , which generates Tj , given 
Tj_i and given the previous system state. The Vi 
feeds the procedure that selects the class of the atom 
Vi to flip and selects the Wj's index according to for- 
mulas (|3.7|1 and (|3.8|) . To compute n using (|3.4|) for 
i = 1 we formally assume tq — 0. As in the sequen- 
tial case, this can be done, because the state of the 
system at time t c = is known. The computations 
continue for as long as Tj computed by (|3.4|l is smaller 
than t c +T max . The last Tj that satisfies this condition 
becomes r TO . 

Steps (b), (c), and (d), that follow are obvious. 
Note that while comparing the generated histories 
with the assumed histories, as required in step (c), the 
PE pays attention only at the single layer of atoms 
that border its region. This is due to the specific 
neighborhood structure that consists of 4 neighbors 
which we have assumed in our Ising model (see Sec- 
tion[2J) . If the geometry were different (e.g., more di- 
mensions, leading to more neighbors; or deeper pen- 
etration of influence) then we might need to check 
more atoms. 

After exchanging information about their produced 
histories in step (c), the PEs detect inconsistencies. 
Any boundary flip generated at the initial itera- 
tion causes inconsistencies in neighbors, because of 
the canonical assumption that there were no flips in 



boundaries. The communication in step (c) leads the 
PEs to have new assumptions about the histories of 
their neighbors, in place of the canonical assumption. 
Subsequent iterations, first step. After the PEs 
synchronize in step (d), they execute step (a) again. 
The first important distinction of this parallel proce- 
dure from its sequential prototype is that the feeding 
sequence (U\, Vi), (U2, V2), ■ ■ ■ of random numbers at 
the subsequent iterations has to be the same as at 
the first iteration. No new random sample Ui or Vi 
in place of the previously generated one should be 
produced. This is in keeping with the notion given 
at the end of Section0|that these numbers are viewed 
as presimulated, and need to be used in order. 

As in the first iteration, each old Ui is employed to 
generate Tj and each old Vi is employed to select atom 
Vi for the flip. Because of different boundary condi- 
tions, the resulting Tj and selected Vi will generally 
differ from those computed at the previous iteration. 

Moreover, the second main distinction between 
parallel and serial algorithms is that computing n, 
given Ti-i and given the system state at time Tj_i, 
is not as straightforward as simply applying (|3.4(l . 
The complication arises because spins in the neigh- 
borhood may flip during time interval (Tj_i,Tj), say 
at time tb, which we call a break point time, Tj_i < 
h < n. 

The class membership of some atoms hosted by the 
PE may change at time tb as a result. This in turn 
may change some Ni and hence may change the A 
computed by (|3.3|l and used in (|3.4() . The presence 
of a break point tb inside interval (Tj_i,Tj), and the 
change of A at tb, makes formula (|3.4(1 unusable as it 
is. 

A more general method adjusts Tj so as to satisfy 
the following equation 

/ K(t)dt= -log Ui. (5.2) 

J T«-l 

Solving H5.2JI for Tj is easily accomplished since A(t) 
changes in time in a piecewise constant fashion. One 
way of solving l|5.2[l is, with self-evident notation, to 
order the break points from t^o = Tj_i to the maxi- 
mum Tb^m — t c + T max , and to compute for I < i < m 
the sums 

k 

h = ^Zin,j -n,j-x) (5.3) 
= f A(i) dt (5.4) 
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If — log Ui > b m then Ti > t c + r max , and we are done. 
If — logC/j < b m , then since the bk are monotone 
increasing, we easily find k such that 

< -logt/, < b k . (5.5) 

Given fc, we have (analogously to equation QH.4[l ) 



Another, more efficient, way to solve <|5.2[1 for n, is 
to generate the bk one at a time until equation l|5.5|l 
is satisfied. 

Another distinction from sequential simulation is 
a possible need for additional random numbers. The 
need may arise as a result of change of spin flip times 
Ti at iterations. A subsequent iteration may fit more 
spin flips, or fewer spin flips, within the time inter- 
val (t c ,t c + T max ). Recall that a prcsimulated list 
will have the random numbers drawn in order, and 
so additional random numbers must be drawn after 
the original ones are used. In particular, the ran- 
dom number U m +x that caused r m+ i to exceed T max 
would be the first one used if more random numbers 
arc needed. If fewer flips are needed, we save unused 
random numbers for possible use in future iterations 
or subsequent time steps. 

Iterations of the synchronous relaxation steps a, 
b, c, and d continue until all PEs realize that the as- 
sumptions they made at the beginning of an iteration 
agree with the information they receive at the itera- 
tion's end; that is, the times Tj and flipped atoms vi 
are the same as in the previous-iteration assumption 
and this hold for all T{ < T max . This signals that it is 
time to advance the committed time to t c = T max for 
a new time step, and to continue with a new round 
of iterations if necessary. 

Subsequent steps. The only difference of this situ- 
ation from those already described is in defining ini- 
tial conditions. Whereas tq = at the first time 
advancement step and the initial system state is that 
at time 0, for the following steps, the initial state of 
the subsystem hosted by a PE is the state resulting 
from the last simulated event at the previous time 
advancement step. 

Let m be the index of the last spin flip simulated in 
the previous step. Using the notation above, the last 
event if any of the previous step was (r m , v m ). If, by 
chance, there were no events until now, then m = 0, 
T m = 0, and v m is undefined. Redefining notations 
for the current time step, the old r m becomes the 



To for this time step. This To is used in this time 
step in the same way as the To of the first time step 
was used. In particular, the t\ can be found using 
either formula l|3.4(l or by solving equation (|5.2() with 
respect to Tj, for i = 1. 

What we called Ui and Vi at the previous time step 
we now rename to U%- m and Vi- m ; so U m +\ becomes 
U\, etc. Again, from viewing random numbers as 
coming from a presimulated list, any remaining ran- 
dom samples V m+ i, U m+ 2, V m +2, must be reused 
before additional random sampling is done. They are 
to be renamed in the new time step as Vx, U%, V2, 
respectively. 

An alternate approach to subsequent steps. 

For Ising simulations, properties of Poisson random 
variables enable us to wipe clean the slate, empty 
event lists of all PEs, and start afresh at the time 
t c + 7m ax . That is, we need not keep any knowl- 
edge of tentative events generated at previous steps. 
New random numbers may be generated at this time, 
taken later from a presimulated list than any up until 
now. This procedure may not be desirable; we sim- 
ply wanted to point out that for Ising simulations it 
leads to unbiased simulations. 

6 Efficiency 

One of the nicest attributes of our algorithm is that 
it is provably efficient. In this section we outline the 
assumptions that go into the proof, and give a state- 
ment of the result. The detailed proofs can be found 
in the forthcoming paper 

The efficiency result is asymptotic as N, the num- 
ber of PEs, becomes large. Our method is to bound 
the probabilistic distribution of the number of iter- 
ations needed to simulate one step of the algorithm. 
In addition, we bound the distribution of the amount 
of work needed to simulate each iteration. We also 
bound from below the average number of events simu- 
lated in each step. These bounds require a particular 
choice of T max as a function of N, and also require 
some regularity of the model being simulated as N 
increases. They also require that the load of simu- 
lated events remains balanced among the PEs as N 
grows. 

Specifically, suppose that the temperature is 
bounded away from zero during the entire time in- 
terval being simulated, and that A is bounded and is 
bounded away from zero, so that the ratio between 
the largest and smallest Poisson rates in the system 
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remains bounded. The blocks of PEs may change size 
as N increases; we suppose that the ratio between the 
number of atoms hosted by the different PEs remains 
bounded, that the ratio of the number of boundary 
atoms between any two PEs also remains bounded, 
and that the number of neighbors of any PE remains 
bounded, all independent of N. We also suppose 
that the number of boundary atoms per PE grows 
no faster than logiV. We choose T max so that for 
each PE, the number of boundary atoms times T max 
is within a constant factor of log TV. Clearly, this is 
possible because of the preceding assumptions. This 
choice ensures that the maximum and minimum ex- 
pected number of events occurring among the bound- 
ary atoms of any PE during a time interval of length 
T max is of order logiV. 

The preceding assumptions relate to the system 
being simulated. Our efficiency result also requires 
some assumptions about the machine doing the sim- 
ulation. We assume that synchronization can be done 
in no more than order logiV time. We assume that 
communicating x events from one PE to a neighbor- 
ing PE takes no more than a constant times x+log N, 
and that this can be done in parallel, so that if x is 
the maximum number of events to be communicated 
from any PE to its neighbor, then all communica- 
tion can be done in no more than a constant times 
x + log N time. We suppose that the speed of each 
PE is within a constant factor of the speed of any 
other PE; speed is the number of operations per unit 
of time. We suppose that the unit of time is chosen so 
that the speeds of the PEs is constant as N increases. 

Now for a bit of notation. For any iteration, we 
let / represent the maximum total number of events 
that occur at the boundary atoms of a PE. That is, 
/ is the maximum number of events that need to 
be communicated at each iteration. We let g denote 
the number of iterations to complete one step. Our 
assumptions show that the amount of time required 
to complete a simulation cycle is bounded above by a 
constant times fg. Note that /, g, and hence fg are 
random variables. We let K denote the maximum 
size of any set consisting of a PE, its neighbors, and 
their neighbors. K is bounded by the assumption 
that the set of neighbors is bounded in size. 

We write x < s y for real- valued random variables 
x and y if 

P(x > z) < ¥{y > z) (6.1) 



holds for any real number z. 

Under the assumptions delineated in the preceding 



section, we derive the following results. 

Theorem 1 If the average number of events in a PE 
during an iteration is logiV, then for each constant 
C>1, 

f< s CelogN + Y, (6.2) 

where Y is a geometrically distributed random vari- 
able with parameter 1/C. 

In particular, this theorem shows that E{f) < 
CelogN + 1/(C - 1), and VVar(f) < CelogN + 
2/(C-l) 2 . 

We also find the following. 

Theorem 2 If the average number of events in a PE 
during an iteration is \ogN, then for each constant 
C > 1 we find 



g < s CeK\ogN + X, 



(6.3) 



where X is a geometrically distributed random vari- 
able with parameter 1/C. 

In particular, this theorem shows that E{g) < 
CeK log N + 1/(C - 1) and ^Var( 5 ) < CeK log N + 
V(C \r- 

Combining these two theorems, we can bound the 
average time for a complete simulation cycle of length 
T max - We may find a constant c such that, for large 
enough N, by Schwarz's inequality, 



E(fg) < VE(P)E(g 2 ) < (clog TV) 2 



(6.4) 



Since our assumptions show that the amount of time 
to complete a simulation cycle is no more than pro- 
portional to fg, equation (|6.4() gives our bound on 
the mean real time required to simulate time T max = 
logiV. Furthermore our assumptions show that in 
time T max = log N we expect to have to within a con- 
stant factor NT max \Gi\Xpi events simulated. There- 
fore, the number of events simulated per unit time is, 
to within a factor c, 



# events NlogN N 

= c = c 

unit time (log TV) 2 logiV 1 



(6.5) 



which is the efficiency estimate we promised. 

We summarize our efficiency result as a theorem. 

Theorem 3 Under the assumptions, the efficiency 
of the simulation is at least of order 1 / log N as N — > 
oo. Specifically, let r(N,t) be the amount of real time 
it takes to simulate an interval of time (0, t) with N 
processors. Then there is a constant B such that 



r(N,t) 



(i + l)logJV 



< B. 



(6.6) 
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We note that it can be shown that our scal- 
ing is optimal, in the sense that by taking a func- 
tion T max = r(N) in place of log AT, where either 
r(N)/ log N -> oo or r(N)/ log N -> as N -» oo, 
gives strictly worse efficiency. The details of all these 
results are in the full paper |13|. 
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